Goto

Collaborating Authors

 generate music


Bias beyond Borders: Global Inequalities in AI-Generated Music

arXiv.org Artificial Intelligence

While recent years have seen remarkable progress in music generation models, research on their biases across countries, languages, cultures, and musical genres remains underexplored. This gap is compounded by the lack of datasets and benchmarks that capture the global diversity of music. To address these challenges, we introduce GlobalDISCO, a large-scale dataset consisting of 73k music tracks generated by state-of-the-art commercial generative music models, along with paired links to 93k reference tracks in LAION-DISCO-12M. The dataset spans 147 languages and includes musical style prompts extracted from MusicBrainz and Wikipedia. The dataset is globally balanced, representing musical styles from artists across 79 countries and five continents. Our evaluation reveals large disparities in music quality and alignment with reference music between high-resource and low-resource regions. Furthermore, we find marked differences in model performance between mainstream and geographically niche genres, including cases where models generate music for regional genres that more closely align with the distribution of mainstream styles.


Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces

arXiv.org Artificial Intelligence

The SS matrices are downsampled to 5 5. The results indicate that, compared to MusicGen, our method produces The new wave of generative models has been explored in the samples that more closely resemble the Pond5 samples literature to generate music. Jukebox [1] is based on Hierarchical in terms of long-term temporal consistency and the diversity VQ-VAEs [2] to generate multiple minutes of music. of recurring sections. Jukebox is one of the earliest purely learning-based models that could generate longer than one minute of music with some degree of structural coherence. Notably, the authors mention that the generated music at a small scale of multiple learn musical structures and forms at all scales. However, seconds is coherent, and at a larger scale, beyond one minute, none of the models in the literature has demonstrated musical it lacks musical form.


Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

arXiv.org Artificial Intelligence

In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: https://github.com/WangTooNaive/MozartsTouch


ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

arXiv.org Artificial Intelligence

In recent years, the burgeoning interest in diffusion models has led to significant advances in image and speech generation. Nevertheless, the direct synthesis of music waveforms from unrestricted textual prompts remains a relatively underexplored domain. In response to this lacuna, this paper introduces a pioneering contribution in the form of a text-to-waveform music generation model, underpinned by the utilization of diffusion models. Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process within the diffusion model framework. Addressing the challenge of limited text-music parallel data, we undertake the creation of a dataset by harnessing web resources, a task facilitated by weak supervision techniques. Furthermore, a rigorous empirical inquiry is undertaken to contrast the efficacy of two distinct prompt formats for text conditioning, namely, music tags and unconstrained textual descriptions. The outcomes of this comparative analysis affirm the superior performance of our proposed model in terms of enhancing text-music relevance. Finally, our work culminates in a demonstrative exhibition of the excellent capabilities of our model in text-to-music generation. We further demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.


Three ways AI is transforming music

AIHub

Each fall, I begin my course on the intersection of music and artificial intelligence by asking my students if they're concerned about AI's role in composing or producing music. So far, the question has always elicited a resounding "yes." Their fears can be summed up in a sentence: AI will create a world where music is plentiful, but musicians get cast aside. In the upcoming semester, I'm anticipating a discussion about Paul McCartney, who in June 2023 announced that he and a team of audio engineers had used machine learning to uncover a "lost" vocal track of John Lennon by separating the instruments from a demo recording. But resurrecting the voices of long-dead artists is just the tip of the iceberg in terms of what's possible – and what's already being done. In an interview, McCartney admitted that AI represents a "scary" but "exciting" future for music.


A Survey of AI Music Generation Tools and Models

arXiv.org Artificial Intelligence

In this work, we provide a comprehensive survey of AI music generation tools, including both research projects and commercialized applications. To conduct our analysis, we classified music generation approaches into three categories: parameter-based, text-based, and visual-based classes. Our survey highlights the diverse possibilities and functional features of these tools, which cater to a wide range of users, from regular listeners to professional musicians. We observed that each tool has its own set of advantages and limitations. As a result, we have compiled a comprehensive list of these factors that should be considered during the tool selection process. Moreover, our survey offers critical insights into the underlying mechanisms and challenges of AI music generation.


MuseCoco: Generating Symbolic Music from Text

arXiv.org Artificial Intelligence

Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.


Inside the music industry's battle with the UK government over AI song generators

#artificialintelligence

Universal Music Group has been asking music streaming services like Spotify to stop developers from scraping its material to train AI bots to make new songs. The label, which controls about a third of the recorded music industry, has also been issuing substantial numbers of takedown requests in relation to AI uploads appearing online. It is the latest move in the music industry's growing battle to prevent AIs from using its songs without licensing them. On a "royalty free music generator" like Mubert, it's already possible to type in a prompt and the programme will use AI to search a catalogue of music for patterns. Tell it to play a "fast voodoo rhythm in the style of a nursery rhyme with some pretty electronics", and it will copy parts of songs that correspond and generate music to match.


Unleashing the Power of AI in Music: A Deep Dive into Jukebox by OpenAI

#artificialintelligence

Jukebox, an innovative AI system created by OpenAI, leverages the power of deep learning to generate music, complete with lyrics and vocals, in a variety of genres and styles. By training on a dataset of 1.2 million songs, Jukebox showcases an unparalleled level of sophistication in music generation, pushing the boundaries of what AI can achieve in the creative arts. At the core of Jukebox lies a cutting-edge neural network architecture, known as a Variational Autoencoder (VAE). The VAE's role is to encode and decode the complex musical information found within the training dataset. This encoding-decoding process enables Jukebox to generate novel and diverse musical compositions by sampling from the latent space, a mathematical representation of the underlying structure of the dataset.


Google Unveils MusicLM, an AI That Can Generate Music from Text Prompts

#artificialintelligence

Google researchers have introduced MusicLM, an AI model that can generate high-fidelity music from text. MusicLM creates music at a constant 24 kHz throughout a number of minutes by modeling the conditional music generating process as a hierarchical sequence-to-sequence modeling problem. According to the research paper, MusicLM was trained on a dataset of 280,000 hours of music to produce songs that make sense for complex descriptions. The researchers also claim their model outperforms previous systems both in audio quality and adherence to the text description. MusicLM samples, includes five-minute pieces produced from only one or two words like melodic techno, as well as 30-second samples that sound like entire songs and are formed from paragraph-long descriptions that prescribe a genre, vibe, and even specific instruments.